AITopics | synthetic data

Collaborating Authors

synthetic data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adaptive Kernel Density Estimation with Pre-training

Zhang, Ruitong, Deng, Ke

arXiv.org Machine LearningMay-14-2026

Density estimation in high-dimensional settings is an important and challenging statistical problem.Traditional methods based on kernel smoothing are inefficient in high dimensions due to the difficulties in specifying appropriate location-adaptive kernels. In this work, we introduce pre-training, a key idea behind many cutting-edge AI technologies, to the context of non-parametric density estimation. By establishing a pre-trained neural network that can recommend an appropriate location-adaptive kernel for each sample point, efficient density estimation with adaptive kernels is achieved in high dimensions. A wide range of numerical experiments show that this strategy is highly effective for improving density-estimation accuracy, when the target distribution is close to the distribution family for pre-training. When the target distribution is substantially different from the pre-training distribution family, the benefit from the proposed pre-training strategy may be diluted, but can be reactivated by an additional fine-tuning procedure.

artificial intelligence, density estimation, machine learning, (17 more...)

arXiv.org Machine Learning

2605.13092

Country:

North America > United States (0.14)
Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

Xu, Yichen

arXiv.org Machine LearningMay-12-2026

Synthetic tabular data are often evaluated by distributional similarity, privacy distance, or train-on-synthetic-test-on-real predictive performance, but these criteria do not ensure validity for causal inference. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can preserve predictive utility while distorting average treatment effect (ATE) estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results, and identify an analogous decomposition in block-level next-token prediction under log loss. Motivated by the tabular causal analysis, we propose a hybrid synthetic-data framework that generates covariates while modeling treatment and outcome mechanisms separately, allowing causal-purpose treatment assignment such as randomized synthetic assignment. We evaluate this framework in three settings: ATE preservation under fully generative versus hybrid synthesis, targeted augmentation for practical positivity problems, and synthetic simulation engines for comparing OR, IPW, AIPW, and TMLE before real-data analysis. Across synthetic and ACTG experiments, hybrid synthesis improves causal fidelity relative to fully generative baselines; LLM-based hybrid synthesis is often more faithful than CTGAN for ATE preservation and finite-sample estimator benchmarking.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2604.23904

Country:

Asia (0.46)
North America > United States (0.28)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.68)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

Active Representation Learning for General Task Space with Applications in Robotics

Neural Information Processing SystemsApr-30-2026, 10:39:52 GMT

Representation learning based on multi-task pretraining has become a powerful approach in many domains. In particular, task-aware representation learning aims to learn an optimal representation for a specific target task by sampling data from a set of source tasks, while task-agnostic representation learning seeks to learn a universal representation for a class of tasks. In this paper, we propose a general and versatile algorithmic and theoretic framework for active representation learning, where the learner optimally chooses which source tasks to sample from. This framework, along with a tractable meta algorithm, allows most arbitrary target and source task spaces (from discrete to continuous), covers both task-aware and task-agnostic settings, and is compatible with deep representation learning practices. We provide several instantiations under this framework, from bilinear and feature-based nonlinear to general nonlinear cases. In the bilinear case, by leveraging the non-uniform spectrum of the task representation and the calibrated source-target relevance, we prove that the sample complexity to achieve ε-excess risk on target scales with (k)2 v 22ε 2 where k is the effective dimension of the target and v 22 (0,1] represents the connection between source and target space. Compared to the passive one, this can save up to 1dW of sample complexity, where dW is the task space dimension. Finally, we demonstrate different instantiations of our meta algorithm in synthetic datasets and robotics problems, from pendulum simulations to real-world drone flight datasets. On average, our algorithms outperform baselines by 20% 70%. 1

artificial intelligence, bsourcew, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Illinois (0.28)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

I)j(II)deoi.52 tt

Neural Information Processing SystemsApr-30-2026, 04:07:30 GMT

We present a new dataset condensation framework termed Squeeze (), Recover () and Relabel () (SRe2L) that decouples the bilevel optimization of model and architectures synthetic and data image during resolutions training, for to ef handle ficient dataset varying condensation.

artificial intelligence, deep learning, machine learning, (19 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Generating multivariate time series with COmmon Source CoordInated GAN (COSCI-GAN)

Neural Information Processing SystemsApr-27-2026, 22:51:12 GMT

Generating multivariate time series is a promising approach for sharing sensitive data in many medical, financial, and IoT applications. A common type of multivariate time series originates from a single source such as the biometric measurements from a medical patient. This leads to complex dynamical patterns between individual time series that are hard to learn by typical generation models such as GANs. There is valuable information in those patterns that machine learning models can use to better classify, predict or perform other downstream tasks. We propose a novel framework that takes time series' common origin into account and favors channel/feature relationships preservation. The two key points of our method are: 1) the individual time series are generated from a common point in latent space and 2) a central discriminator favors the preservation of inter-channel/feature dynamics. We demonstrate empirically that our method helps preserve channel/feature correlations and that our synthetic data performs very well in downstream tasks with medical and financial data.

artificial intelligence, deep learning, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (0.88)
Health & Medicine > Government Relations & Public Policy (0.68)
Health & Medicine > Health Care Providers & Services > Reimbursement (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Efficiently Factorizing Boolean Matrices using Proximal Gradient Descent

Neural Information Processing SystemsApr-25-2026, 00:11:17 GMT

Addressing the interpretability problem of NMF on Boolean data, Boolean Matrix Factorization (BMF) uses Boolean algebra to decompose the input into low-rank Boolean factor matrices. These matrices are highly interpretable and very useful in practice, but they come at the high computational cost of solving an NP-hard combinatorial optimization problem. To reduce the computational burden, we propose to relax BMF continuously using a novel elastic-binary regularizer, from which we derive a proximal gradient algorithm. Through an extensive set of experiments, we demonstrate that our method works well in practice: On synthetic data, we show that it converges quickly, recovers the ground truth precisely, and estimates the simulated rank exactly. On real-world data, we improve upon the state of the art in recall, loss, and runtime, and a case study from the medical domain confirms that our results are easily interpretable and semantically meaningful.

artificial intelligence, machine learning, matrix, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.93)
North America > Canada (0.68)
North America > United States > New York (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Add feedback

Learning to See by Looking at Noise

Neural Information Processing SystemsApr-24-2026, 19:56:20 GMT

Current vision systems are trained on huge datasets, and these datasets come with costs: curation is expensive, they inherit human biases, and there are concerns over privacy and usage rights. To counter these costs, interest has surged in learning from cheaper data sources, such as unlabeled images. In this paper, we go a step further and ask if we can do away with real image datasets entirely, by learning from procedural noise processes. We investigate a suite of image generation models that produce images from simple random processes. These are then used as training data for a visual representation learner with a contrastive loss. In particular, we study statistical image models, randomly initialized deep generative models, and procedural graphics models. Our findings show that it is important for the noise to capture certain structural properties of real data but that good performance can be achieved even with processes that are far from realistic. We also find that diversity is a key property for learning good representations.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

GlucoSynth: Generating Differentially-Private Synthetic Glucose Traces Anonymous Author(s) Affiliation Address email

Neural Information Processing SystemsApr-24-2026, 15:31:54 GMT

We focus on the problem of generating high-quality, private synthetic glucose1 traces, a task generalizable to many other time series sources. Existing methods for2 time series data synthesis, such as those using Generative Adversarial Networks3 (GANs), are not able to capture the innate characteristics of glucose data and cannot4 provide any formal privacy guarantees without severely degrading the utility of the5 synthetic data. In this paper we present GlucoSynth, a novel privacy-preserving6 GAN framework to generate synthetic glucose traces. The core intuition behind our7 approach is to conserve relationships amongst motifs (glucose events) within the8 traces, in addition to temporal dynamics. Our framework incorporates differential9 privacy mechanisms to provide strong formal privacy guarantees.

artificial intelligence, machine learning, motif, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.69)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.93)
(2 more...)

Add feedback

Filters

Collaborating Authors

synthetic data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Adaptive Kernel Density Estimation with Pre-training

Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

Active Representation Learning for General Task Space with Applications in Robotics

I)j(II)deoi.52 tt

cec8ad7715d0d13899d5d7d31970f527-Supplemental-Conference.pdf

Generating multivariate time series with COmmon Source CoordInated GAN (COSCI-GAN)

Efficiently Factorizing Boolean Matrices using Proximal Gradient Descent

1838feeb71c4b4ea524d0df2f7074245-Paper-Datasets_and_Benchmarks.pdf

Learning to See by Looking at Noise

GlucoSynth: Generating Differentially-Private Synthetic Glucose Traces Anonymous Author(s) Affiliation Address email